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Abstract —One of the key challenges in applying reinforce¬ 
ment learning to complex robotic control tasks is the need to 
gather large amounts of experience in order to find an effective 
policy for the task at hand. Model-based reinforcement learning 
can achieve good sample efficiency, but requires the ability to 
learn a model of the dynamics that is good enough to learn an 
effective policy. In this work, we develop a model-based rein¬ 
forcement learning algorithm that combines prior knowledge 
from previous tasks with online adaptation of the dynamics 
model. These two ingredients enable highly sample-efficient 
learning even in regimes where estimating the true dynamics 
is very difficult, since the online model adaptation allows the 
method to locally compensate for unmodeled variation in the 
dynamics. We encode the prior experience into a neural network 
dynamics model, adapt it online by progressively refitting a local 
linear model of the dynamics, and use model predictive control 
to plan under these dynamics. Our experimental results show 
that this approach can be used to solve a variety of complex 
robotic manipulation tasks in just a single attempt, using prior 
data from other manipulation behaviors. 

1. Introduction 

One of the remarkable features of human and animal motor 
control is the ability to quickly adapt to new situations. 
When a child is asked, for example, to stack two unfamiliar 
Lego blocks, he or she might play with the objects and 
take a small amount of time to learn about their dynam¬ 
ics, but will typically succeed at the task very quickly. 
This is sometimes referred to as one-shot learning, where 
an agent must successfully perform a task given one, or 
very few attempts. Reinforcement learning (RL) provides 
a computational framework for robots to learn new motor 
skills, but typical applications of RL focus more on mastery 
than one-shot learning, and require a substantial number of 
training episodes [1]. Model-based RL methods reduce the 
required interaction time by acquiring a model of the system 
dynamics, and using this model to discover an effective 
policy. However, model-based RL algorithms that operate in 
this way must be equipped with a model that can represent 
a good approximation to the dynamics, and they must be 
provided with sufficient experience to optimize this model to 
produce accurate dynamics predictions [2]. A single attempt 
at the task is often insufficient to obtain such an accurate 
model, and while these challenges can be mitigated by 
incorporating domain knowledge about the system [3], [4] or 
demonstrations [5], they make the development of a general- 
purpose one-shot learning method difficult. 

We propose to address the first challenge by developing a 
method that can use a coarse model of the system dynamics 
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Fig. 1: Diagram of our method: the robot uses prior experience 
from other tasks (a) to fit a neural network model of the dynamics 
of object interaction tasks (b). When faced with a new task (c), 
our algorithm learns a new model online during task execution, 
using the neural network as a prior. This new model is used to plan 
actions, allowing for one-shot learning of new skills. 

that is adapted online to better match the most recent expe¬ 
rience. This frees the method from needing a representation 
that can capture global dynamics with high accuracy, and 
requires only the ability to accurately represent the dynamics 
locally, which we show can be done with a simple linear 
model. The accuracy of the global model still affects the 
proficiency with which a new task can be performed, since 
more accurate models require less adaptation, but even an 
inaccurate initial model is sufficient for our method to per¬ 
form a variety of manipulation tasks on the first attempt. We 
show that even a simple linear global model can allow this 
framework to perform a variety of manipulation tasks with 
nonlinear dynamics, while a more sophisticated nonlinear 
model based on neural networks greatly improves the success 
rate on more complex tasks. Online adaptation also offers 
us a way to address the second challenge: since the global 
model does not need to be very accurate, it can be estimated 
using data from tasks that are different from the one being 
attempted. This enables our approach to perform one-shot 
learning of new tasks by using the robot’s prior experience 
on other tasks, without requiring explicit domain knowledge 
or demonstrations to be provided by the designer. A diagram 
of our method is shown in Figure We demonstrate our 
approach by learning a variety of challenging, contact-rich 
manipulation behaviors on a PR2 robot. All of our motion 
skills involve low-level torque control of the robot’s motors, 
and high dimensional state, corresponding to joint angles, 





joint velocities, and end-effector pose. 


III. Background 


II. Related Work 

Reinforcement learning techniques have been applied to a 
range of robotic control problems [2]. Such methods can be 
broadly categorized as model-free methods, which directly 
learn a control policy from system interaction [6], and model- 
based methods, which first learn a model of the system 
dynamics, and then optimize the policy under this model 
[1]. Model-based methods can be substantially more sample- 
efficient and typically achieve the fastest learning times, but 
they require a model representation that can be used to 
learn an accurate estimate of the true dynamics. Typically, in 
order to learn the model from a small amount of interaction, 
they make various assumptions, such as smoothness [7], 
[8] or access to prior knowledge about the system [3], [4], 
since the general problem of learning a complex, nonlinear 
function from a small amount of data is difficult. Contact-rich 
robotic manipulation skills present an especially challenging 
model learning problem, and simple assumptions, such as 
smoothness and prior knowledge, are often insufficient to 
acquire a model that is accurate enough for manipulation. 

In this paper, we instead specifically seek to learn new 
behaviors with a coarse model of the system dynamics 
that does not adequately capture all of the intricacies of 
the real system, and locally adapt this model online based 
on the most recent experience. In this way, our approach 
resembles adaptive control [9]. However, unlike standard 
adaptive control, we incorporate an expressive (but poten¬ 
tially inaccurate) prior model of the dynamics, constructed 
from previous experience on other manipulation tasks, and 
use only high-level objectives, such as the desired position 
of a target object, rather than simple trajectory tracking. This 
makes our method suitable for complex robotic manipulation 
skills with only high-level specification in the form of a 
cost function. Experience-based priors have previously been 
suggested in reinforcement learning [10], though typically 
in the context of accelerating iterative model-free learning. 
In contrast, our work demonstrates one-shot learning, where 
the robot can immediately perform new tasks by leveraging 
prior experience. 

In order to choose the actions under our locally adapted 
model, we use model predictive control (MFC) based on 
differential dynamic programming (DDF) [11]. Neural net¬ 
work models have recently been combined with MFC for 
planar task-space control of cutting tasks [12]. While this 
approach is similar in spirit to the one presented in this 
work, we tackle a wider range of robotic manipulation tasks 
using direct, low-level torque control of the entire 7 degree 
of freedom arm. Since the tasks that we tackle are quite 
different from the prior experience used to train the neural 
network, local adaptation is required for success, as shown 
in our experimental results. Outside of robotic manipulation, 
MFC has been combined with online and offline adaptation 
[13], [14], [15], but typically in the context of trajectory 
tracking, rather than learning new skills with high-level 
specification. 


In reinforcement learning, as well as in optimal con¬ 
trol, the aim is to control a dynamical system given by 
xt+i = /(xt, Ut) by choosing the actions Ut to minimize the 
total cost where T is the time horizon. We 

consider fixed-horizon episodic tasks in this paper, though 
our method can also be extended to an infinite horizon formu¬ 
lation. When the system dynamics /(xt,Ut) are unknown, 
we can construct an estimate of the dynamics /(xt, Ut), for 
example by fitting a parameterized function approximator 
to data from prior system interactions, and then optimize a 
sequence of actions under the estimated dynamics /(x^, Ut). 

We will make use of the iterative linear quadratic regulator 
(iLQR) algorithm [16] to optimize the actions with respect to 
the cost under an estimated dynamics model. This algorithm 
can be viewed as an application of the Gauss-Newton method 
for trajectory optimization, and requires iteratively lineariz¬ 
ing the dynamics around the current nominal trajectory, 
denoted f = {xi, ui,..., xt, ut}, constructing a quadratic 
approximation to the cost, computing the optimal actions 
with respect to this approximation of the dynamics and 
cost, and running the resulting actions forward to obtain 
a new nominal trajectory. In the derivation below, we will 
assume that the linearized stochastic dynamics are given by 
p(xt+i|x(, Ut) = A/'(/xtX(+/utUt+/ct, F(), where F* is the 
covariance of the Gaussian dynamics noise, and the quadratic 
approximation to the cost consists of a linear term ^xut 
and a quadratic term ^xu,xut, where the subscripts denote 
differentiation with respect to the vector [x^; u^]. When the 
dynamics are linear and the cost is quadratic, the Q-function 
and the value function are both quadratic, and given by 

V (X() = ix^T4,xtXt + x^Fxt + const 

Q(xt,ut) = i[xt;U(]'^Qxu.xut[xt;U(] + [x(;ut]'^Qxut+const 

We can express them with the following recurrence: 

Qxu,xut ~ ^xu,xut 3“ T/xut^^,xt+l/xut 
Qxut — ■^xut + 7/xut^t+l 
^x,xt — Qx,xt ~ QU,XtQu,utQu,xt 
^Xt = Qxt ~ Qu,XtQu,UtQut^ 

which allows us to compute the optimal control law as 
g{xt) = ut + kt + Kt{xt - xt), where K* = -Qu,ut*5u,xt 
and k( = —QuutQut- Performing a forward rollout using 
this control law allows us to find a new nominal trajectory, 
and the backward dynamic programming pass is repeated 
around this trajectory to take the next Gauss-Newton step. 
The discount factor 7 allows us to reduce the weight on later 
states. While DDF is typically used with 7 = 1, we found 
that we could obtain better results under uncertain, estimated 
dynamics with 7 = 0.95. Intuitively, this reflects the fact 
that predictions further into the future are less likely to 
correspond to reality under uncertain dynamics, so it makes 
sense not to weight them as highly. 




Iterative LQR can be used to optimize a trajectory offline 
under known system dynamics, or even with dynamics 
estimated from previous executions on a physical system 
[17]. However, in this work we instead use iterative LQR 
to perform model-predictive control, where the policy is 
recomputed in real time at each time step to update the next 
action in response to the current state. Iterative LQR is well 
suited for MFC because it is fast, can be effectively used 
with short horizons, and readily allows warm-starting with 
the previous solution [18]. 

However, using iterative LQR for MFC still requires us 
to obtain an estimate of the linearized system dynamics, 
given by /xt and fut- The dynamics can be estimated 
from a simulator of the physical system, but this requires 
considerable knowledge about the system being controlled. 
In the case of robotic manipulation, it is reasonable to expect 
to obtain a good model of the robot, but we often lack a 
model of the objects that the robot is interacting with, so 
being able to handle unknown dynamics is highly desirable. 
In Section |n| we discussed how previous methods have 
approached this problem by using a variety of physical and 
statistical estimates. In this work, we take a different ap¬ 
proach, and do not attempt to accurately learn a global model 
of the dynamics. Instead, we assume that we have access to 
only a weak prior model of the form p(xt, Ut, x^+i), and 
estimate a local linear model of the dynamics under this 
prior. The local linear model is updated at each time step, 
which allows our method to gradually adapt to changing 
dynamics conditions and compensate for inaccuracies in the 
prior model. 

IV. Model-Based Reinforcement Learning with 
Online Dynamics Adaptation 

Our model-based reinforcement learning approach with 
online dynamics adaptation consists of using MFC (with 
iterative LQR) to repeatedly update the robot’s policy under 
an evolving dynamics model. The dynamics model is linear 
time-varying, but is recomputed at each time step based 
on the recently observed states and actions, as well as the 
dynamics prior, which can take on one of a number of 
different representations and is trained on previous robot 
experience, which might even be from a different task. In 
this section, we describe how linear dynamics can be fitted 
under a dynamics prior, as well as our scheme for updating 
the dynamics online based on the robot’s recent experience. 
We then summarize our model-based reinforcement learning 
algorithm. 

A. Fitting Dynamics with Priors 

In order to fit a linear model of the dynamics to a set 
of N samples {xi,Ui,x'}, we can simply use standard 
linear regression to determine /x, /u, and /c. To make 
it more convenient to incorporate prior information, we 
will first reformulate this linear regression fit and view it 
as fitting a Gaussian model to the dataset {x, u, x'}, and 
then conditioning this Gaussian to obtain p(xt+i |xt, Ut). 
While this is equivalent to linear regression, it allows us 


to easily incorporate a normal inverse-Wishart prior on this 
Gaussian in order to bring in prior information. Let £ be 
the empirical covariance of our dataset, and let (i be the 
empirical mean. The normal-inverse-Wishart prior is defined 
by prior parameters jiQ, m, and no. Under this prior, the 
maximum a posteriori estimates for the covariance S and 
mean /i are given by 


N F no 

m/io + nojl 

F = -^-• 

m + no 


( 1 ) 


Having obtained H and fi, we can obtain an estimate of 
the dynamics p(xt+i |xt, Ut) by conditioning the distribution 
A/’(/i, S) on [xt; Ut], which produces linear-Gaussian dynam¬ 
ics p(xt+i |xt, Ut) = + /utUt + /ct, Ft), given by 


/xu ^[xu,xu] 

fc — M[x'] ~ /xuM[xu] 

F = S[x',x'] - /xuS[xu,xu]/Ju (2) 


where the bracketed subscripts denote the submatrix corre¬ 
sponding to the specified elements. The parameters of the 
normal-inverse-Wishart prior are obtained from some prior 
model of the dynamics. We discuss how a variety of global 
dynamics model, such as Gaussian mixture models and 
neural networks, can be used to produce the prior parameters 
/io, ni, and no in Section [V| 


B. Online Estimation of Locally Linear Dynamics 

In the batch setting, such as the one described in prior 
work [17], the empirical mean /} and covariance S are 
estimated from a batch of previously collected data. In this 
work, we instead would like to update the model online 
during execution. To do this efficiently, we use an online 
estimate for jl and S. The mean can be estimated simply by 
using the following update: 


Ft ^ ^Ft-i + (1 “ /^)Pt9 (3) 

where pt = [x^-i; u^-i; x^] is the observation and [3 
is a discounting factor that causes the model to forget old 
data. For the covariance, first observe that, in the batch case, 
S = ^ PiP? — We can therefore estimate St 
as St = ~ AtA?’ where At is the current estimate of 

\ Pt'P?’ which we update according to: 

At ^ /?At_i + (1 - /3)ptpJ- (4) 


These updates allow us to estimate the empirical mean At 
and covariance St at the t* step using recently observed state 
transitions. The posterior mean and covariance S and /i can 
then be recovered using the prior and the method described 
in the previous section. We initialize Ao and Aq by fitting 
a single Gaussian to the data used for training the prior, in 
order to start with a reasonable estimate of the dynamics. 

The value of p determines how quickly the algorithm “for¬ 
gets” past experiences. When the true dynamics are nonlin¬ 
ear, as in most robotic manipulation tasks, we should forget 





Algorithm 1 Model-based reinforcement learning with on¬ 
line adaptation 

1: for time step t = 1 to T do 
2: Observe state xt 

3: Update fit and via Equations and 0 

4: Compute — At A? 

5: Evaluate prior to obtain fio, m, and no (see 

Section |V]) 

6: Update A and A as described in Equation (g) 

7: Compute ji and E via Equation 0 

8: Compute /xt, fut^ fct^ and from /i and E via 

Equation 0 

9: Run LQR to compute K^, k^, and Qu,ut 

10: Sample U( from V(u( + kj + Kt(xt - X(), Qu,ut) 

11: Take action ut 

12: end for 


past experiences more quickly when the state is changing 
rapidly. However, when the robot becomes stuck in a difficult 
situation that requires a more accurate dynamics estimate, it 
must incorporate more data, which requires a larger value of 
A- The effective sample size N is also important, since it 
controls the relative strength of the prior. The true effective 
sample size is in fact inversely proportional to 1-/3. We 
adaptively adjust both A and N based on the relative accuracy 
of the empirical and prior dynamics estimates. The intuition 
is that, when the prior is less effective, we should weight the 
empirical estimate more highly. We should also raise A, since 
the empirical estimate factors more highly in the dynamics, 
and therefore must use more of the past data to improve 
its accuracy. This intuition is also supported by the inverse 
relationship between 1 — A and N. 

Specifically, we can form the prediction for the most 
recently observed transition from (xt_i,Ut-i) to x^ using 
both the empirically estimated parameters and At and 
the prior parameters ^^t and /iot- We condition both 
and on (xt_i,ut_i) and predict 

the most probable value for the current state, which we 
denote Xt for the prediction from the empirical parameters 
and Xt for the prediction from the prior parameters. We 
then compute the ratio of the errors in these predictions, 
to compare whether the prior or empirical estimate is more 
accurate and update A and N\ 

/3=|p ^—(3 = l-VoP N = uo/p, (5) 
ll^t “ xtll 

where 770 and uq are hyperparameters of the algorithm. Note 
that, in these updates, N is inversely proportional to 1-/3, 
as expected, and the proportionality constant is controlled by 
770 and z/Q, which we set as z/q = 1 and 770 = 8 in all of our 
experiments. 

C. Algorithm Summary 

The structure of our method is summarized in Algorithmic 
At each time step, the method observes the current state Xt 
and uses [xt_i, Ut-i, Xt] to update the current estimate of 


the empirical mean and covariance A and E as described in 
the previous section. The method then evaluates the prior to 
obtain /io, and tiq, and updates /3 and N based on 
the accuracy of the empirical and prior state prediction. The 
empirical and prior estimates are then combined according to 
Equation Q to construct the posterior mean and covariance, 
from which we can obtain an estimate for the current 
dynamics. These estimated dynamics are then used, together 
with a local second order expansion of the cost function, 
to optimize a new linear feedback policy using LQR. This 
feedback policy, given by g{xt) = ut + k* + Kt(xt - x*), 
can then be used to choose the next action u^. 

In practice we often want to perform a small amount of 
exploration. Eor example, if the robot is attempting a peg 
insertion task, and the peg becomes jammed in the hole, 
simply applying the estimated optimal action repeatedly may 
be insufficient. The robot must “wiggle” its arm to figure out 
the contact dynamics. To that end, we add a small amount of 
Gaussian noise to the action. The amount of noise to add is 
determined by the Q-function obtained from LQR, by setting 
the covariance to be proportional to Q”ut- This choice of 
covariance is motivated by the observation that it produces a 
maximum entropy policy that properly trades off randomness 
and cost minimization, as discussed in prior work [19]. 


V. Neural Network Dynamics Priors 


In this section, we discuss several possible choices for 
the dynamics prior, describe how the normal-inverse-Wishart 
prior parameters can be obtained from these priors, and go 
into detail on the particular neural network prior that we use 
in this work. The various choices for the priors are compared 


in our experimental evaluation in Section VI 


A. Gaussian and Gaussian Mixture Priors 

As discussed in the previous section, our online estimate of 
the dynamics linearization makes use of a dynamics prior. 
The simplest prior can be obtained by fitting a Gaussian 
distribution to vectors [x; u; x'] obtained from a large batch 
of prior data. If the mean and covariance of this prior data 
are given by ft and S, the prior is given by ^ = tt^qS and 
fiQ = p, while no and m should be set to the number of 
data points in the prior datasets. In practice, setting no and 
777 to 1 tends to produce substantially better results, since the 
online empirical mean and covariance are typically obtained 
from a much smaller number of samples. 

A more sophisticated choice of prior explored in previous 
work is a Gaussian mixture model over vectors [x;u;x'], 
which allows modeling of nonlinear dynamics [17]. Under 
this model, the state transition tuple is assumed to come from 
a distribution that depends on some hidden state hi, which 
corresponds to the mixture element identity. In practice, 
this hidden state might correspond to the type of contact 
profile experienced by a robotic arm at step i. The prior is 
obtained by inferring the hidden state hi for the latest tuple 
[x^_i; Ui_i; xA, and using the mean and covariance of the 
corresponding mixture element to obtain p and S. The prior 
parameters can then be obtained as described above. 










B. Neural Network Priors 


The Gaussian prior is limited to representing globally 
linear dynamics, while the Gaussian mixture model can only 
represent a small number of locally linear modes, and has 
limited representational capacity in complex state spaces. 
Recent work has shown that neural networks can learn very 
good dynamics models for tasks ranging from helicopter 
flight [20] to cutting vegetables [12]. We therefore also 
evaluated the use of neural networks for representing the 
dynamics prior, by training a neural network to map from 
state-action tuples [x; u] to the next state x'. 

In order to construct a dynamics prior from a neural 
network /([x; u]) at the current state and action [x^; u^], we 
first linearize to obtain 

- T 

/([x; u]) /([xj; Ui]) + ([x; u] - [x^; Uj]). 

From this linearization, we can construct a local prior mean 
and covariance according to 


M = 


S = 


/([xi;uj]) 


-'XU,XU 

df 


T_ 
df Y 
d[x;u] ^xu,xu 


- _ 

y aj df Y _ 

^xu,xu^[x;u] d[x;u] ^xu,xu^[x; 


df 


The prior state-action covariance Sxu,xu determines the 
strength of the prior, and we set it to al, where a is a 
free parameter. The conditional covariance Sx',x' can be 
obtained from the empirical covariance between the network 
predictions /([x;u]) and the target next states x' in the 
training dataset. Using the above p and S, the normal- 
inverse-Wishart prior parameters can be obtained in the same 
way for the Gaussian and mixture of Gaussians priors. 


C. Neural Network Architectures 

In order to represent the dynamics prior, we used a 
neural networks with two hidden layers, each with rectified 
linear units given by 2 ; = max(a,0). The first hidden layer 
consists of 60 hidden units, and the second consists of 40. 
We evaluated two network architectures, both of which are 
illustrated in Figure The first takes the current state x^ 
and current action Uf as input, and predicts accelerations. 
These accelerations are then integrated via a semi-implicit 
integration rule to obtain a prediction for x^+i, and the 
network is optimized to minimize the error in predicting the 
entire next state x^+i. 

Our second architecture is identical to the first, but instead 
takes as input the current and previous states and actions 
(xt_i, Ut-i, Xt, Ut). This additional temporal context is very 
important when the dynamics prior is trained on other 
tasks, especially tasks that involve contacts with objects 
in the environment. Since the state x^ only includes the 
configuration of the robot, it does not model the geometry 
of the scene, which might change across tasks. The state is 
therefore not Markovian across tasks, and a network without 
additional context cannot accurately predict the next state 
xt+i from only the current state x^ and action Uf. We found 



Contextual Network 


Fig. 2: Diagram of the neural network architectures used in our 
experiments. We found that using a short temporal context as input, 
as shown in network (2), improved the results for manipulation tasks 
that involved contact dynamics. Both networks produce accelera¬ 
tions which are used to predict the next state. 

that the network without temporal context tried to explain 
contacts by exploiting certain regularities. For example, when 
the end-effector stopped after hitting an object, the network 
assumed that a hard contact had occurred and there is an 
impassable obstruction in the way. With a temporal context 
that included the previous state and action, the network was 
able to determine whether or not a contact was happening 
by comparing the previous applied joint torques to the 
acceleration actually experienced by the robot. 

D. Prior Training Data 

The neural network prior must be trained on previous 
interaction data in order to provide a helpful prior model of 
the system dynamics. However, because the neural network 
only acts as a prior, it can be trained on previous interaction 
data from different tasks. This allows our method to perform 
one-shot learning of new manipulation tasks, using no prior 
data from that task itself. In practice, the amount of data 
required to learn an effective prior is considerable, so we 
used data collected from a variety of sources to provide 
a sufficiently diverse dataset. The total training set for the 
physical robot experiments had 6.6 hours of data, collected 
at 20 Hz. For each individual task, we excluded the data 
collected for that task when training the prior, which reduced 
the effective dataset by about 15%. This corresponds to a 
type of holdout cross-validation. Likewise, in the simulated 
experiments, we collected approximately 12 simulation hours 
of data at 20Hz across 4 different tasks, and excluded the data 
from the task being executed from the training set. 

The dataset consisted of trials from the various evaluation 
tasks (workbench, gears, airplane, car, and ring tasks for 
the physical system and peg insertion and stacking for the 
simulated system), which are described in the next section, 
as well as random motion of the arm in free space. For 
each task, data was collected using a previous reinforcement 
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Fig. 3: Simulated tasks used for evaluation: (a) Cylindrical peg 
insertion (b) Cross-shaped peg insertion (c) Stacking over a cylin¬ 
drical peg (d) Stacking over a square-shaped peg 

learning method [21], though in practice this data could 
have also been generated using prior experience from the 
method presented in this paper, human demonstrations, or 
any other method that provides good coverage of interesting 
states (e.g. states that involve contact with objects in the 
world). Although the size of this dataset is considerable, the 
resulting prior, when combined with our online adaptation 
method, can generalize effectively to other tasks, making it 
quite universal. 

VI. Experimental Evaluation 
A. Experimental Setup 

We evaluated our method on a range of manipulation tasks 
using a physical PR2 mobile manipulator and a simulated 
PR2 arm using the MuJoCo simulator [11]. The simulated 
tasks provide a large-scale comparison between methods, and 
the physical tasks demonstrate effectiveness on a real system. 

On the physical system, our method was used to control 
the right arm of the robot, while the left arm was used 
to brace objects for two-object manipulation and assembly 
tasks. We evaluated the method by inserting a toy nail into 
a tool bench, including a high-friction version of the task 
designed to make it more difficult, placing wooden rings onto 
pegs, putting together parts of a gear assembly, assembling 
a toy car and airplane, and stacking blocks. These tasks are 
shown in Eigure]^ Our simulated tasks consisted of a pair of 
peg insertion tasks with different shaped pegs, and a pair of 
peg stacking tasks, where the robot needs to fit a piece with 
a square-shaped opening over a cylindrical or square shaped 
peg. These are shown in Eigure A trial was defined to be 
a failure if the distance to a target pose was not sufficiently 
small at the end of a time threshold of 10 seconds. 

The state space consisted of several measurements from 
the active right arm of the robot: joint angles, joint angular 
velocities, the pose of the end effector encoded as 3 Cartesian 
points, and the velocities of those 3 points. This amounted 
to a 32-dimensional state space. The controller operated at 
20 Hz with an MFC horizon of 15 timesteps (0.75 seconds). 

The cost function for each task depended on the distance 
d between the current pose of the end-effector and a target 
pose required for successful completion. Eor example, in the 
case of the toy nail task, the target pose involved positioning 


the gripper such that the nail was successfully inserted into 
a hole in the toy tool bench. The shape of this cost follows 
prior work [21], and is given by ri{d) = wd? log{d‘^^a), 
where a = 10“^, w = 1.0, and v = 0.01. This type of 
shape encourages the robot to accurately place the object in 
precisely the desired configuration. In addition, we placed a 
quadratic penalty on the magnitude of the torques. 

As described in the previous section, the neural network 
dynamics prior was trained using data collected from all of 
the other tasks, but no data from the task being tested. This 
corresponds to a hold-one-out cross-validation scheme. 


B. Comparisons 


We evaluated our adaptive method with two neural net¬ 
work priors: network #1 in Eigure which uses only the 
current state Xt and action Ut to predict the next state 
Xt+i, and network #2, which also uses the preceding state 
and action. In addition, we also compared to a number of 
baselines. The non-adaptive baseline used network #2, which 
was the better of the two neural network priors, to directly 
plan actions using MFC, which most closely resembles the 
structure of a previously proposed neural network-based 
MFC method [12]. The Gaussian process baselines used a 
Gaussian process either as a prior for the adaptive approach, 
or as the model (without adaption) for MFC, which most 
closely reflects a range of recent Gaussian process model- 
based reinforcement learning algorithms [7], [22]. Since 
the Gaussian process is a nonparametric model, we were 
unable to use the same amount of data with this model and 
maintain real time performance. We therefore subsampled 
the data to the largest size that still permitted online MFC 
(approximately 10000 training points, which was about 5% 
of the total training set for the physical robot). Finally, the 
regularized model substituted a global linear model instead 
of the neural network dynamics prior. 


We show the success rates for each method on the simu¬ 
lated robot in Table [T| and on the physical robot in Tables |n| 
(for the standard and high-friction variant of the toolbench 
and toy nail task, as well as for the wooden ring on a 
peg task) and HI (for the block, gears, airplane, and car 
assembly tasks). Note that each of the runs used to evaluate 
each method was performed separately, with no information 
retained between runs. The adaptive method outperformed 
the non-adaptive variants, indicating the importance of online 
adaptation for models trained on other tasks. The neural 
network prior also achieved the best overall performance, 
particularly when using the preceding state and action as 
context, although even the simple least-squares prior was 
able to accomplish some of the simpler tasks. 


We also show the success rates for our adaptive method 
with network #2 on each of the other tasks. These results 
show that our method was able to succeed on a wide range 
of challenging manipulation tasks on the first try, without 
using any data from that task to train the dynamics prior. 






Fig. 4: Tasks used in our evaluation: (a) inserting a toy nail into a toolbench, (b) inserting the nail with a high-friction surface to increase 
difficulty, (c) placing a wooden ring on a tight-fitting peg, (d) stacking toy blocks, (e) putting together part of a gear assembly, (f) 
assembling a toy airplane and (g) a toy car. 


Task 

Ins, Cylinder (a) 

Ins, Cross (b) 

Stack, Cylinder (c) 

Stack, Square (d) 

Total 

Adaptation + regularization 

10/25 

5/25 

18/25 

10/25 

43/100 

Adaptation + GMM prior 

10/25 

6/25 

25/25 

18/25 

59/100 

Contextual network (#2) 

8/25 

1/25 

19/25 

12/25 

40/100 

Adaptation + network #1 

14/25 

8/25 

20/25 

14/25 

56/100 

Adaptation + Contextual network (#2) (full method) 

19/25 

11/25 

25/25 

20/25 

75/100 


TABLE I: Success rate of our method on simulated tasks, as well as comparisons to a range of baselines representative of prior methods 
on three of the tasks. 


Task 

Toolbench (a) 

W/ friction (b) 

Ring (c) 

Total 

Adaptation + regularization 

2/5 

0/5 

4/5 

6/15 

Adaptation + GMM prior 

0/5 

1/5 

2/5 

3/15 

Gaussian process 

0/5 

0/5 

0/5 

0/15 

Adaptation + gaussian process 

1/5 

0/5 

0/5 

1/15 

Contextual network (#2) 

3/5 

3/5 

4/5 

10/15 

Adaptation + network #1 

4/5 

0/5 

3/5 

7/15 

Adaptation + Contextual network (#2) (full method) 

5/5 

4/5 

4/5 

13/15 


TABLE II: Success rate of our method on each physical task. Note that our approach regularly 
succeeds at the task despite only using training data from other tasks. 


Task 

Success Rate 

Block (d) 

4/5 

Gears (e) 

4/5 

Airplane (f) 

3/5 

Car (g) 

3/5 


TABLE III: Success rates on addi¬ 
tional tasks using adaptation with 
network #2 only. 


Task 

Block 

Toolbench 

Ring 

Target position error 

0.5 cm 

1.0 cm 

1.5 cm 

0.5 cm 

1.0 cm 

1.5 cm 

0.5 cm 

1.0 cm 

1.5 cm 

Contextual network (#2) 

2/5 

0/5 

0/5 

4/5 

2/5 

0/5 

4/5 

4/5 

1/5 

Adaptation + contextual network (#2) (full method) 

4/5 

1/5 

0/5 

3/5 

3/5 

1/5 

4/5 

3/5 

0/5 


TABLE IV: Robustness results in the presence of target position errors. The target was intentionally offset from the true position of the 
target by the indicated amount in random direction along the horizontal plane. 


C Robustness 


We also evaluated the robustness of our method to obser¬ 
vation errors. In these experiments, the target position in the 
cost function was intentionally corrupted by fixed magnitude 
errors. In real-world scenarios, these errors might stem from 
imperfect observations produced, for example, by a vision 
system. The results of the robustness experiments on two of 


the tasks are presented in Table IV For comparison, we also 
include robustness results for the neural network only (using 
network #2), without adaptation. The ring was the easiest of 
the three tasks, since the rounded top of the peg provides 
some tolerance, while the block was the hardest. Note that 
adaptation is particularly helpful on the harder tasks. 


D. Qualitative Results and Conclusions 

Our experimental results show that our model-based 
reinforcement learning algorithm with online adaptation 
can achieve a range of challenging manipulation tasks 
on the first attempt. Furthermore, our comparison of the 
various prior models shows that a neural network prior 


with a short temporal context achieved the best results, 
though no prior model by itself was as successful as 
the corresponding adaptive variant. Videos of our tasks, 
which can be viewed in the supplementary material and 
on the project website, (http://rll.berkeley.edu/ 
iros201 Gonlinecontrol/index. html ) show that 
the resulting behaviors can take some time to succeed at the 
task (though never more than 20 seconds). Much of this time 
is spent exploring unfamiliar dynamical modes, for example 
after an unexpected contact. This kind of exploration reflects 
a very natural strategy for performing unfamiliar tasks: 
probing the object until progress is made in the desired 
direction, while gradually acquiring a more accurate model. 
It is precisely this probing and gradual self-improvement that 
is key to the robustness and effectiveness of this approach. 
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VIL Discussion and Future Work 

In this paper, we presented a model-based reinforcement 
learning algorithm that uses online adaptation of its dynamics 
model combined with a coarse prior model obtained from 
experience on other prior tasks. We use an MFC method 
based on iterative LQR to choose the actions under the 
current adapted model, which is a local linear approximation 
to the nonlinear dynamics of the system. Although this linear 
model is simple, our method is able to perform complex 
manipulation tasks in highly nonlinear systems by adapting 
it to the most recent experience and re-planning. The coarse 
prior model is therefore not required to accurately model the 
true dynamics of the system, and only provide a guess that 
is refined through online adaptation. Our experiments show 
that even a fully linear prior model can be used for some 
behaviors, while a more sophisticated neural network prior 
allows our method to complete more complex tasks. 

We believe that combining online model adaptation with 
prior experience by means of a dynamics prior is a powerful 
idea because it allows our method to succeed even with unex¬ 
pected environmental variation and inaccurate prior models. 
Furthermore, by using rich function approximators such as 
large neural networks to distill prior experience into a concise 
parametric representation, the robot can progressively be¬ 
come more proficient at acquiring new skills as its experience 
grows, similar to how a person can quickly learn new skills 
by drawing on past experiences. However, in contrast to 
humans, robots can pool their collective experience and use 
the combined data to train shared dynamics priors. Exploring 
this type of multi-robot learning, with shared priors but 
individual adaptation, is an exciting direction for future work. 

While in our evaluations, we aggregated data from all 
other tasks to train the dynamics model, we may want to 
only use related tasks to avoid negative transfer. This may 
be accomplished by clustering or grouping the prior tasks. 

Although we demonstrate that our approach can complete 
a variety of challenging manipulation tasks, a limitation 
of this method is that we require a full, Markovian state 
of the system, which is needed to perform MFC using 
iterative LQR. This assumption is not unusual for MFC- 
based methods, but can be limiting in the context of robotic 
manipulation. One approach for addressing this is to learn 
a latent state representation from raw sensory input, as pro¬ 
posed in recent work [23], [24], [25], and perform MFC on 
this learned representation with online dynamics adaptation. 

Another avenue for future work is to combine other kinds 
of prior information with online updates. For example, a prior 
policy might be constructed from experience of related tasks, 
and refined in a similar fashion as the dynamics adaptation. 
This could allow our relatively short-horizon MFC procedure 
to acquire longer lookahead through the use of the prior 
policy, and allow the use of rich sensory data as demonstrated 
in recent work on policy search with neural networks [26]. 
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